High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors

نویسندگان

  • Peng Du
  • Piotr Luszczek
  • Jack J. Dongarra
چکیده

In the multi-peta-flop era for supercomputers, the number of computing cores is growing exponentially. However, as integrated circuit technology scales below 65 nm, the critical charge required to flip a gate or a memory cell has been dangerously reduced, causing higher cosmic-radiations-induced soft error rate. Soft error threatens computing system by producing silently data corruption which is hard to detect and correct. Current research of soft errors resilience for dense linear solver offers limited capability when facing large scale computing systems, and suffers from both soft error and round-off error due to floating point arithmetic. This work proposes a fault tolerant algorithm that can recover the solution of a dense linear system Ax = b from multiple spatial and temporal soft errors. Experimental results on the Kraken Supercomputer confirm scalable performance of the proposed fault tolerance functionality and negligible overhead in solution recovery.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High Performance Linear System Solver with Resilience to Multiple Soft Errors

In the multi-peta-flop era for supercomputers, the number of computing cores is growing exponentially. However, with integrated circuit technology scaling below 65 nm, the critical charge required to flip a gate or a memory cell is dangerously reduced. Combined with higher vulnerability to cosmic radiation, soft errors are expected to become anything but inevitable for modern supercomputer syst...

متن کامل

An Evolutionary Method for Improving the Reliability of Safetycritical Robots against Soft Errors

Nowadays, Robots account for most part of our lives in such a way that it is impossible for usto do many of affairs without them. Increasingly, the application of robots is developing fastand their functions become more sensitive and complex. One of the important requirements ofRobot use is a reliable software operation. For enhancement of reliability, it is a necessity todesign the fault toler...

متن کامل

SASSIFI: Evaluating Resilience of GPU Applications

As GPUs become more pervasive in both scalable high-performance computing systems and safety-critical embedded systems, evaluating and analyzing their resilience will grow increasingly important. As soft errors, such as those caused by high-energy particle strikes, form an important fraction of in-field hardware errors, GPU designers must develop tools and techniques to understand the effect of...

متن کامل

Efficient Parallel Solvers for Large Dense Systems of Linear Interval Equations

Verified solvers for dense linear (interval-)systems require a lot of resources, both in terms of computing power and memory usage. Computing a verified solution of large dense linear systems (dimension n > 10000) on a single machine quickly approaches the limits of today’s hardware. Therefore, an efficient parallel verified solver for distributed memory systems is needed. In this work we prese...

متن کامل

An efficient distributed randomized solver with application to large dense linear systems

Randomized algorithms are gaining ground in high performance computing applications as they have the potential to outperform deterministic methods, while still providing accurate results. In this paper, we propose a randomized algorithm for distributed multicore architectures to efficiently solve large dense symmetric indefinite linear systems that are encountered, for instance, in parameter es...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012